Task 3: Time-Based Analysis¶

Introduction¶

The objective of this task is to analyze temporal trends in the dataset and develop insights or models based on time-series features. Data exploration is conducted to identify interesting trends and patterns related to time. Time-series analysis techniques are applied to uncover meaningful temporal insights, and the findings are interpreted to offer valuable conclusions about the data's time-dependent behaviors.

Background¶

The information presented in this report is gathered from the following sources:

  • Information outlined in the project requirements document;
  • Details provided on Kaggle;
  • Documentation and code traced back through GitHub commits.

Before diving into the analysis, it is essential to understand the nature of the data. This step is critical as it guides actions such as:

  • Making reasonable assumptions about the data;
  • Handling duplicated and missing values;
  • Interpreting and understanding the results of this report.

Data¶

The dataset has the following characteristics:

  • Original source: The data comes from Otomoto.pl, a popular Polish car sales platform. It consists of self-reported information from individuals and agencies. Most fields are filled using dropdown menus, while numeric fields allow users to input their values. The platform also offers unstructured data, such as images and detailed car descriptions, though these are not included in the dataset.

  • Method of collection: The dataset was scraped from the Otomoto.pl website by a student from Warsaw's Military University of Technology as part of their coursework. It represents a snapshot of the platform’s data at a specific point in time, approximately three years ago on December 4, 2021.

  • Scope: The dataset includes 208,304 observations across 25 variables.

  • Timeline: The timeline based on the offer publication date ranges from March 26, 2021 to May 5, 2021.

1. Data Preparation and EDA¶

1.a. Duplicated and Missing Values¶

Before proceeding with any further work, it is essential to ensure that any duplicate values are removed from the dataset. In a business context, this refers to ads that contain identical information. These duplicates typically arise when the website logic fails to filter out identical listings. Below are some key statistics related to this matter:

Next, we will evaluate the completeness of the data by measuring the number of non-empty values for each variable. I have categorized the variables into three groups as follows:

  • Green: Fully usable variables.
  • Yellow: Variables with an acceptable level of completeness, where it is reasonable to remove NAs and proceed.
  • Red: Variables with an unacceptable level of completeness, requiring removal.

1.b. Pre-Processing and Feature Engineering¶

At this stage, I will categorize the variables into two groups: numeric and categorical. Based on the variable type, I will perform the following actions:

  • Numeric variables: Pre-process and proceed with all meaningful variables.
  • Categorical variables: Identify variables with a low number of levels and either apply one-hot encoding or transform them into a numeric format. As you may have noticed, I am placing significant emphasis on converting all variables into numeric format, as this is a requirement for certain dimensionality reduction and clustering methods.

For the categorical variables mentioned above, we will focus on those with a lower number of levels or classes: Drive, Condition, and Transmission. Due to their limited number of levels, these variables are easier to convert into dummy variables for interpretation:

  • Condition and Transmission: Suitable for one-hot encoding.
  • Drive: Contains too many levels. I may consider combining some levels into a new category called 4x4 (all), but first, I will examine the behavior within the existing classes.

Additionally, some categorical variables can be converted into numeric format for improved usability and comparability:

  • Features: Transform into a numeric variable, Number_of_features.
  • Offer_publication_date: Convert into Days_on_market.

For numeric values, there are some quick wins in transformation activities. I am applying the following:

  • Price: Using the Currency column, convert to Price_in_CAD for improved interpretability.
  • Production_year: Transform into Vehicle_age for easier interpretation.

It is important to note that the data is self-reported, which means we may question the reasonability of certain values. The variables outlined below have been capped to remove extremely high values that could introduce noise into the data. Some high values were retained as long as the distribution appears to follow a continuous scale. For capping purposes, the following limits were applied:

  • Mileage_km: Capped at 1,000,000.
  • Doors_number: Capped at 6.
  • Prince_in_CAD: Capped at 1,000,000.

1.c. Temporal EDA¶

In this subsection, now that the data has been cleaned, we will explore the temporal characteristics of the dataset.

It is clear that the data is not evenly distributed by the offer listing date, which we anticipated. The data was collected by "taking a snapshot" of the website on May 5, 2021, likely around midnight. Older listings are not present, as they are deleted once the offer is settled. On average, vehicles take about a month to be sold.

3. Time-Series Analysis¶

Given the short period of quality data (less than a month), there is limited opportunity to model seasonality. However, we can explore the autocorrelation function to examine how the previous day's average price influences the next day's average. Additionally, statsmodels allows us to assess correlations at various time lags, such as a few days or a week.

<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

4. Interpretation and Insights¶

Overall, the nature of the data makes it unsuitable for time-series analysis. The data was collected through web scraping at a specific point in time. Since this is an ad listing website, once a deal is settled, the ad is typically removed, and the data point no longer exists. A potential solution would be to perform web scraping at regular intervals (e.g., weekly) to ensure that data loss is minimized.